The simplest formal "neuron" models are elemental processing units
with n inputs and one single output. To generate this output, the unit first calculates a weighted sum of the n inputs {x1,x2...xn}, a scalar function of the input vector. The resulting scalar value is sometimes called the net input. or net. This value is then mapped to the output through either a discrete thresholding (step) or a continuous transfer function, usually having the shape of a sigmoid (Fig 1).
The output has an independent offset or bias which is treated as an additional input of constant value (-1) with weight w0.
w0 becomes a threshold value when discrete units with step functions are used. This extra input is useful when learning algorithms are considered. It allows the bias or threshold to be learned like any other weight.
We will start our exploration of Artificial Neural Networks with these continuous units.
Q: What is the utility of such a unit, by itself or in a network?
A: Each unit is an adjustable function generator, if we assume that the summing parameters, i.e., the weights can be varied. With the additional dummy input (-1), the value of the net input is the Dot or Inner product of the extended input vector with the weight vector. Both have the same dimension (n+1). If these vectors are normalized to unit length the net input equals to the cosine of the angle between the vectors, hence a measure of similarity between the input vector or "pattern" and the weight vector. With purely linear units, the processing of input data as well as the recipe for weight learning are analytically tractable. Considerable insight will be gained by examining the underlying theory as will be done in the next section.
When a nonlinear transfer function is introduced, the output of each processing unit becomes constrained by the shape of this function. With a monotonous continuous function such as the previously mentioned sigmoid, it is easy to see how the unit becomes a classifier. This is clear when the transfer function is a thresholding step: the output can only assume one of two values (0,1) or (+1,-1) according to the sign of the net input. The dividing boundary is a hyperplane in the input space between the respective regions of positive and negative outputs . Hence the unit can classify data vectors (points) in the input space into two separate categories. The change of sign and the hyperplane boundary also holds with continuous sigmoid units, since sigmoid functions are linear around their threshold. In this case the value of the output is continuous through the boundary and approaches saturation at some distance from the hyperplane.
Q: How does one find the weights that produce the right divider ?
A: This is the "Learning" problem. At this point we will be concerned with "supervised learning", i.e., learning from examples. The examples, are input vectors paired with a "desired" output. This is often called a "training dataset", or simply training set.
We will start with a purely linear transfer function to get a theoretical foothold on the problem. Then, if all the training data can be given in advance (batch), the problem has a theoretical solution from linear algebra. The pseudo-inverse operator yields weights such that the input/output pairs produced by the unit match in the mean square sense, the data pairs in the training set.
An equivalent strategy is to follow the gradient over the error surface globally defined over the weight space by the training set. The surface is a paraboloid and gradient descent will therefore optimally lead to the (unique) minimum. Unlike the pseudo-inverse method, the gradient descent approach can be applied to nonlinear continuous units as well. The error surface however is specific to each specific training set and can only be correctly calculated if all the training pairs are used in calculating the error function.
If the training examples and weight corrections occur one at the time, we can still obtain an estimate of the control surface and use the gradient to guide us toward the minimum. This is sometine called the stochastic approximation technique in system theory. The weights are learned by presenting the data repetitively, one at the time and typically several times over. This approach has robust properties, however convergence can no longer be guaranteed, a limitation that pervades all neural network methods.
We first observe that if we were given three arbitrary input-output pairs produced by a linear unit, then these three points on the planar output suface uniquely define a plane. Since the parameters in the plane equation are the three weights, we should be able to recover their values
by the Inverse operator. Actually, the equation is invariant if these values are multiplied by a positive constant, but normalization will render them unique.
To illustrate the process, let us generate the outputs generated with three arbitrary input pairs, (each augmented by the dummy input (-1) that allows for bias:
Since we have three vector equations of the form y=w.x, and if we have the input vectors x properly arranged in a 3x3 matrix X (this will require transposing inputVectors), we can write w = y.X^(-1). Hence, applying the Inverse operator, we should recover the weightVector:
The result should match the previously chosen weightVector; this can be checked by showing that the difference between the two vectors is equal to or is very close to zero.
Linear Regression: Calculating Best Weights over a set of sample points
:[font = text; inactive; preserveAspect]
With three input vectors, the weight identification problem solves exactly. It is neither overconstrained or underconstrained since there is one and only one plane passing throught three points, unless the points are colinear.
We now will generalize this analytical solution to the overconstrained case, i.e., when there is a set of input vectors and "desired outputs" and the best approximating plane is sought, in the space defined by the input-output pairs. In other words, given a set of input-output pairs, we will seek to calculate the weights that generate an "optimal" plane through the points defined by the input vectors and corresponding outputs.
This is also called "linear regression": finding the best plane that fits data vectors when there are more than (n+1) points where n is the dimensionality of the input vectors. The data points need not, and generally will not fall on the resulting plane and may not even form an approximate planar arrangement.
A theoretical solution will be obtained if we interpret "optimal" as meaning that the plane defined by the optimal weights found is such that the mean square distance from all data points to the plane is minimized (Least Mean Squared or LMS).
Let us first pick up numPoints data points at random and plot the result:
We will now create output data. First we will generate the values that fall exactly on the planar surface defined by the previously chosen set of weights:
Linear Algebra tells us that the best linear match to these can be calculated as previously by using the PseudoInverse operator. At this time, since we have outputs that fall exactly on the theoretical plane, we should exactly or almost exactly recover the weightVector. The Pseudo-Inverse works in this case like the Inverse did earlier:
Aeta, applying the pseudo-inverse operator, we should get the LMS weight vector, which should not be very far from the original since the points still lie near the original plane.
Lyapunov Functions, Error Gradients and Steepest Descent: Iterative Regression.
:[font = text; inactive; preserveAspect]
Q: Since weights can be calculated, why bother with learning?
A: The previous method is usually not applicable.
The first reason is that the pseudo-inverse approach is a batch method, i.e., you must have all the data at once. In a typical learning situation, data comes one point at a time, and one will want to get successive approximations of the optimal weights. The second reason is of course that this batch method works only in the linear case. It will not accommodate units that have a nonlinear transfer function such as the sigmoid.
Learning can be stated as an optimization problem. If we can find a global cost function of the weights whose minimization leads to the best set of weights (a Lyapunov function), we have a basis for successive approximation (the steepest descent). A natural candidate is quadratic function, the mean square of the difference between desired (given in the training set) and observed outputs, summed over the whole data set. To calculate this function aeta requires that all the training data be available at the onset. In the linear case, it can be shown that the pseudo-inverse operator finds weights that minimize this particular function.
The shape of this error function is that a hyperparaboloid with a unique minimum. Its ellipticity is affected by the amount of covariance between input pairs. We can check the statistics, represented by the covariance matrix, of the input data above. First we observe that, for a uniform distribution over the {0,1} range, the Expected value of the variance (diagonal) terms is 1/3 while the covariance terms should be zero.
Of course we should expect to be somewhat off from these theoretical values because 1) we have only a few points and 2) the pseudo-random generator is bound to generate unwanted correlations.
Multiplying the ( numPoints by 2) matrix formed from inputVectors by its transpose and normalizing will yield the data correlation matrix.
The error function is a scalar function of the weight vectors . We will make it visualizable by looking only at the dependency on w1 and w2. We will leave the bias w0 fixed to the value just found. (It should be close to zero anyway since the data generator is symmetrical in the +1,-1 range.)
The cost is the sum of the squared differences between the given (desired) outputs and the computed output, expressed as a function of the current values of the weights w1 and w2. Here is an expression for it, using the Sum primitive.
The Unformatted text for this cell was not generated.
Use options in the Actions Preferences dialog box to
control when Unformatted text is generated.
;[o]
-ContourGraphics-
:[font = text; inactive; preserveAspect]
The package "PlotField" (Version 2.0 only) lets us plot the gradient as little arrows. The plot confirms that this function, being quadratic, has a simple minimum. Therefore a minimization procedure continually following the negative of the gradient vector will lead to the optimal weights.
The Unformatted text for this cell was not generated.
Use options in the Actions Preferences dialog box to
control when Unformatted text is generated.
;[o]
-Graphics-
:[font = text; inactive; preserveAspect]
Starting at some arbitrary initial weight value, the Linear Unit can be driven to the correct weights by following the gradient. This method is known as Steepest Descent..
This procedure will always converge if the step size is sufficiently small. The method is also the most effective algorithm to carry Pseudo-Inverse calculations, avoiding matrix inversions.
We first enter a definition for the gradient function
We can now apply the steepest descent rule, starting from arbitrary weight values and going down in the direction aetast the gradient until a minimum is reached:
The batch gradient descent method converges immediately to the minimum. In the next section, we will look at what happens when a correction is made after each data presentation following what is known as the Widrow-Hopf rule or "Delta" rule.
Aproximating the Global Steepest Descent with local estimates: The Widrow-Hoff Rule
:[font = text; inactive; preserveAspect]
We will now show that under reasonable circumstances, the Linear Unit can indeed progressively learn to approach the correct weights on the basis of input-output pairs presented one at a time.
This sequential approach has been known for a long time as the Stochastic Approximation method (Robbins and Monroe, 1951) and is known as the Widrow-Hoff rule in Neural Network cicles. The gradient value computed piecemeal is an unbiased estimator of the actual gradient. Convergence is no longer guaranteed but requires in particular that the input pair samples be presented approximately with equal probability. The gain also cannot be too large, and can be progressively reduced as the minimum is approached.
We will make successive corrections to the weightVector (starting at some arbitrary value, say, at {w1Up,w2Up}), on the basis of the difference (error) observed locally between the given (desired) output for the current input and the computed output that is observed.
The Widrow-Hoff learning rule will work best if the input vectors are statistically independent and applied in random order, i.e., when the cost function is searched with maximum variety and in an unbiased way.
To test this statement, the reader is invited to change the data set and see what happens when a sizeable covariance is introduced between the (x1, x2) input pairs.
Generalization to Non-Linear Units: The Delta Rule
:[font = text; inactive; preserveAspect]
We can apply the same successive approximation method to nonlinear units having smooth, output transfer function (otf). Sigmoid functions have special advantages and are used extensively in Neural Networks units.
Q: Why is the Sigmoid often chosen for the transfer function?
A: It is a monotonic function bounded from above and below (this property is also exhibited by the depolarization function of live neurons). Hence it has self-normalizing properties. In addition, and most importantly, it works neatly as a classifying function by providing a transition of adjustable sharpness between the two classes. At the limit, it becomes a step function and the unit becomes a true threshold unit. The outputs become binary. If the inputVectors are also binary a threshold unit implements a Boolean function.
Q: What is a mathematical expression for the Sigmoid?
A: Different forms are used. A good one is the hyperbolic tangent which is bounded by -1 and +1. It is usually calculated explicitly as a function of the Exp function.
A: Yes, it can be simply visualized if we limit the number of inputVectors to two. Then the output is a surface over the the {x1,x2} plane. First we define the unit output to be the sigmoid of a weighted sum over the inputVectors, including the bias input.